Code
library(tidyverse)
library(knitr)
library(broom)
expenditure <- read_csv("data/expenditure_per_student_primary_percent_of_gdp_per_person.csv")
completion <- read_csv("data/primary_completion_rate_total_percent_of_relevant_age_group.csv")library(tidyverse)
library(knitr)
library(broom)
expenditure <- read_csv("data/expenditure_per_student_primary_percent_of_gdp_per_person.csv")
completion <- read_csv("data/primary_completion_rate_total_percent_of_relevant_age_group.csv")For our project, we are using two distinct datasets to analyze the relationship between government expenditure on primary education (explantory variable) and primary school completion rates (response variable) across various countries. Our explanatory and response variables were both taken from Gapminder, which got its data from The World Bank, which collects its data from the statistical systems of its member countries (https://data.worldbank.org/about).
Dataset 1: Expenditure per Student, Primary (% of GDP per Person)
This dataset contains annual data on the percentage of GDP per capita that is spent on each primary school student. GDP per capita is the sum of all goods and services produced in a country (GDP) divided by the country’s population (per capita).
The variables included are:
Country: The name of the country.
1995-2018: 23 variables, 1995-2018, each representing the year of measured expenditure
The data spans a couple decades, allowing us to observe trends and changes in educational expenditure over time. Each cell represents the expenditure per student for that year and country. For instance, in 2000, Argentina allocated 12.3% of its GDP per capita to primary education per student, while by 2013, this figure had varied.
Dataset 2: Primary Completion Rate, Total (% of Relevant Age Group)
This dataset provides annual data on the primary school completion rate, indicating the percentage of primary schoolers that successfully complete primary education.
The variables included are:
Country: The name of the country.
1970-2022: The year for which the data is recorded.
The data covers a wide range of years, each cell represents completion percentage for students in primary school. For example, in 1980, Afghanistan had a completion rate of 27.7%, which changed significantly over the following decades.
By combining these datasets, we aim to explore the correlation between the financial investment in education and the corresponding completion rates across different countries for the time period of 2006-2016. This analysis will help us understand the impact of educational funding on student outcomes and provide useful insights on this relationship.
Hypothesized relationship between the variables (and any outside references). Discussion of data cleaning process and decisions.
Presented as a percentage of the GDP per capita, government expenditure per student is defined as the average government expenditure, spread across three domains (current expenditure, capital, and transfers), per student in the given level of education. It is hypothesized that primary school completion rates across various countries increase as the government expenditure per primary student increases. This hypothesis is based on the reasoning that greater expenditures can lead to better-trained and better-paid teachers, which directly impacts student performance and completion rates. Additionally, investments in school infrastructure create a conducive learning environment, reducing dropout rates and encouraging completion. Further, additional funding can provide students with textbooks, technology, and other learning resources that support their education and reduce barriers to completion. For instance, the National Center for Education Statistics reports that in the United States, between 2019 and 2020, 11% of current expenditures on education were allocated to food contracts, janitorial services, transportation, and professional development for teachers. These expenditures are crucial for maintaining a supportive and effective learning environment. Providing nutritious meals, ensuring clean and safe facilities, and offering reliable transportation all contribute to student well-being and readiness to learn. Professional development for teachers ensures they are well-equipped with the latest teaching strategies and knowledge, directly impacting student performance and completion rates. This example illustrates how targeted educational expenditures can foster an environment conducive to higher primary school completion rates. (https://nces.ed.gov/fastfacts/display.asp?id=66)
Presented as a percentage of the GDP per capita, government expenditure per student is defined as the average government expenditure, spread across three domains (current expenditure, capital, and transfers), per student in the given level of education. It is hypothesized that primary school completion rates across various countries increase as the government expenditure per primary student increases. This hypothesis is based on the reasoning that greater expenditures can lead to better-trained and better-paid teachers, which directly impacts student performance and completion rates. Additionally, investments in school infrastructure create a conducive learning environment, reducing dropout rates and encouraging completion. Further, additional funding can provide students with textbooks, technology, and other learning resources that support their education and reduce barriers to completion. For instance, the National Center for Education Statistics reports that in the United States, between 2019 and 2020, 11% of current expenditures on education were allocated to food contracts, janitorial services, transportation, and professional development for teachers. These expenditures are crucial for maintaining a supportive and effective learning environment. Providing nutritious meals, ensuring clean and safe facilities, and offering reliable transportation all contribute to student well-being and readiness to learn. Professional development for teachers ensures they are well-equipped with the latest teaching strategies and knowledge, directly impacting student performance and completion rates. This example illustrates how targeted educational expenditures can foster an environment conducive to higher primary school completion rates. (https://nces.ed.gov/fastfacts/display.asp?id=66)
We chose to limit our analysis to the period from 2006 to 2016, as this range contained the majority of the data with minimal missing values. For each year to be considered acceptable, both expenditure and completion rate needed to be recorded. Any year missing either of these metrics was excluded. Additionally, countries with fewer than 8 acceptable measurements within this timeframe were omitted to ensure that our analysis had a sufficient amount of data for each country.
Our data cleaning process involved transforming the datasets for expenditure per person and primary school completion rates into a long format for the selected years. We then merged these datasets by country and year. And then filtered the data as mentioned earlier.
We also summarized the dataset by calculating the average expenditure and completion rate for each country over the specified period.
# Pivot the data
expenditure_longer <- expenditure |>
select(country, `2006`:`2016`) |>
pivot_longer(cols = `2006`:`2016`,
names_to = "year",
values_to = "expenditure")
completion_longer <- completion |>
select(country, `2006`:`2016`) |>
pivot_longer(cols = `2006`:`2016`,
names_to = "year",
values_to = "completion_rate")
# Join the datasets
joined <- expenditure_longer |>
full_join(completion_longer,
join_by(country == country, year == year))
# Remove countries with too little data
ACCEPTABLE_THRESHOLD <- 8
joined <- joined |>
mutate(is_acceptable = !is.na(completion_rate) & !is.na(expenditure)) |>
group_by(country) |>
mutate(num_acceptable = sum(is_acceptable)) |>
filter(num_acceptable >= ACCEPTABLE_THRESHOLD &
is_acceptable == TRUE)
# Summarize the Data
avg_joined <- joined |>
summarize(avg_expenditure = mean(expenditure),
avg_completion = mean(completion_rate))
head(avg_joined)# A tibble: 6 × 3
country avg_expenditure avg_completion
<chr> <dbl> <dbl>
1 Argentina 14.1 103.
2 Belize 15.8 103.
3 Bulgaria 21.3 102.
4 Burkina Faso 21.2 50.4
5 Cape Verde 15.6 91.7
6 Colombia 15.4 112.
avg_joined |>
ggplot(aes(x=avg_expenditure, y=avg_completion)) +
geom_point(color = "steelblue") +
geom_smooth(method = "lm", color = "hotpink", size = 1.5) +
labs(title = "Completion Rate vs Expenditure Per Person (Primary School Level)",
x = "Expenditure (% of GDP Per Person)",
y = "",
subtitle = "Completion Rate (% of Primary Age Group)") #loading necessary libraries
library(countrycode)
library(gganimate)
library(gifski)
#adding continents variable
joined_conts <- joined |>
mutate(continent = countrycode(sourcevar = country, origin = "country.name", destination = "continent"))#dont run this chunk (animated graph) in rstudio - 100 graphs get downloaded
#only run this through rendered html
joined_conts |>
mutate(year = as.integer(year)) |>
ggplot(aes(expenditure, completion_rate, color = continent)) +
geom_point(alpha = 0.8, show.legend = TRUE) +
labs(title = "Completion Rate vs Expenditure Per Person from 2006 to 2016
(Primary School Level)",
subtitle = 'Year: {frame_time}
Completion Rate (% of Primary Age Group)',
x = "Expenditure (% of GDP Per Person)",
y = "",
color = "Continent") +
transition_time(year) +
ease_aes('linear') +
theme(plot.subtitle = element_text(size=10, vjust = -2),
axis.title.x = element_text(size=10),
plot.margin = margin(t = 20, r = 20, b = 20, l = 20, unit = "pt"))Linear regression is a statistical method used to model the relationship between a dependent variable (often denoted as Y) and one or more independent variables (often denoted as X1,X2,…Xp). The core idea behind linear regression is to fit a linear equation to observed data that best explains the relationship between the dependent variable and the independent variables.
Linear regression assumes that there is a linear relationship between the dependent and independent variables, the errors (residuals) are normally distributed, the variance of the errors is constant (homoscedasticity) and the errors are independent of each other.
model <- lm(avg_completion ~ avg_expenditure, data=avg_joined)
regression <- tidy(model)
kable(regression, caption = "Government Expenditure Regression Model")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 83.7383373 | 6.9876281 | 11.983800 | 0.0000000 |
| avg_expenditure | 0.4661166 | 0.3787711 | 1.230603 | 0.2243464 |
The estimated regression model is as follows: \[{\hat{avg\_completion}} = 83.74 + 0.466(avg\_expenditure)\]
The intercept represents the estimated value of avg_completion when avg_expenditure is zero. The estimate of 83.7383 suggests that if no money is spent (avg_expenditure = 0), the average completion rate is predicted to be 83.7383 percent.
The coefficient for avg_expenditure represents the change in the average completion rate for a one-unit increase in average expenditure. The estimate of 0.4661 suggests that for each additional percentage point of average expenditure, the average completion rate increases by 0.4661 percentage points.
Another way to analyze our regression model is to measure its variance. In the context of our regression model, variance measures how much the predicted primary school completion rates vary around the mean of these predicted values. A low variance indicates that the predicted values tend to be close to their mean, implying our model is consistently predicting accurate completion rates. On the other hand, a high variance indicates that the predicted values are spread out over a wider range, suggesting our model’s predictions are less accurate.
aug <- broom::augment(model)
variances <- data.frame(
Variable = c("Response Values",
"Fitted Values",
"Residuals"),
Variance = c(var(aug$avg_completion),
var(aug$.fitted),
var(aug$.resid)))
kable(variances, caption = "Regression Model Variances")| Variable | Variance |
|---|---|
| Response Values | 262.905600 |
| Fitted Values | 7.881709 |
| Residuals | 255.023891 |
With this variance table, we can calculate our R-squared value with the formula \[R^2 = \frac{{\text{Fitted Values Variance}}}{{\text{Response Values Variance}}}\]. Using this formula gives us an R-squared value of 0.02998. This means that 2.998% of the variance seen in primary school completion rate can be explained by government expenditure. This means that our linear model is not well-fit to predict primary school completion given government expenditure. This may be because, as seen in the first graph, there does not appear to be a linear correlation between our two variables. As such, a linear regression model would not be able to effectively predict one given the other.
With our simple linear regression, we can generate predictions using the predict() function. Then, we can add random errors to the predictions, using the residual standard error estimated from the linear regression model (acquired with sigma()).
pred_avg_completion <- predict(model, data = avg_joined)
est_sigma <- sigma(model)
rand_error <- function(x, mean = 0, sd){
sim <- x + rnorm(length(x), mean, sd)
return(sim)
}
set.seed(1234)
sim_response <- tibble(sim_avg_completion = rand_error(pred_avg_completion,
sd = est_sigma))Below is a scatter plot of the observed average completion versus the simulated average completion.
full_data <- avg_joined |>
select(avg_expenditure, avg_completion) |>
bind_cols(sim_response)
viz3 <- full_data |>
ggplot(aes(x = sim_avg_completion,
y = avg_completion)) +
geom_point() +
geom_abline(slope = 1,
intercept = 0,
color = "orange3",
linetype = "dashed") +
theme(aspect.ratio = 1) +
labs(x = "Simulated Average Completion",
y = "",
subtitle = "Observed Average Completion",
caption = "Viz 3")
viz3Here is a table of some summary statistics of a linear regression run on observed average completion versus simulated average completion:
fit <- lm(avg_completion ~ sim_avg_completion, data = full_data) |>
broom::glance()
fit <- fit |>
select("r.squared", "adj.r.squared", "sigma", "statistic", "p.value")
kable(fit,
col.names = c("R-squared",
"Adjusted R-squared",
"Sigma",
"F-statistic",
"p-value"),
caption = "Summary Statistics of Linear Regression Model",
format = "markdown")| R-squared | Adjusted R-squared | Sigma | F-statistic | p-value |
|---|---|---|---|---|
| 0.0169252 | -0.0031376 | 16.23978 | 0.8436109 | 0.3628653 |
R-squared (0.0169):
Adjusted R-squared (-0.0031):
Residual Standard Error (16.23978):
The analysis of these values indicates that while the simulated data captures the general trend and variability of the observed data, there are significant differences in terms of scatter distribution, presence of outliers, and central tendency.